AI Ram Ollama Continue Dev

A llamafile is just a single, executable file that bundles the llama.cpp engine with model weights.

However, for a 30GB+ model like our Q8_0 GGUF, creating a 30GB executable is impractical. The real power-user workflow, which perfectly suits your goal, is to use the llamafile executable as a portable server and tell it to load an external GGUF file.

This gives you the best of all worlds:

  • Portability: A single, ~15MB llamafile executable that you can drop on any (Linux/macOS/Windows) machine.
  • Power: You can load any GGUF file you want, including our 30GB gemma-3-27b-it-q8_0.
  • Performance: You can pass all the llama.cpp performance flags (--mlock, --threads, --n-gpu-layers) directly to it.
  • Customization: You can apply LoRA adapters on the fly.
  • Integration: It starts an OpenAI-compatible server, which is exactly what tools like Continue.dev need.

Here is the complete walkthrough to create the ultimate, portable, high-performance coding experience.


## The "Portable Powerhouse" llamafile Walkthrough

### Phase 1: Acquire Your "Engine" and "Fuel"

We need two things: the llamafile executable (the engine) and our high-fidelity model (the fuel).

  1. Download the llamafile Executable: Go to the llamafile GitHub releases page and download the main llamafile-0.8.6 (or newer) executable. We don't need a model bundled with it.

    # Create a workspace
    mkdir -p ~/llamafile-power-setup
    cd ~/llamafile-power-setup
    
    # Download the llamafile binary
    wget https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.6/llamafile-0.8.6
    
    # Make it executable (Linux/macOS)
    chmod +x llamafile-0.8.6
    
    # On Windows, you would just rename it to "llamafile-0.8.6.exe"

    You now have your portable engine.

  2. Download the High-Fidelity GGUF Model: This is the same as our previous step. We'll download the 30GB Q8_0 model and place it in a models folder.

    # Install the tool to download from Hugging Face
    pip install -U huggingface-cli
    
    # Create a models directory
    mkdir -p ./models
    
    # Download our 30GB Q8_0 model
    huggingface-cli download \
      paultimothymooney/gemma-3-27b-it-Q8_0-GGUF \
      gemma-3-27b-it-q8_0.gguf \
      --local-dir ./models \
      --local-dir-use-symlinks False

### Phase 2: Launch the Server with Max Performance Flags

This is the core of the setup. We will run our llamafile executable and pass it all the high-performance llama.cpp flags.

First, find your PHYSICAL core count (e.g., sysctl -n hw.physicalcpu on macOS or lscpu | grep "Core(s) per socket" on Linux). We'll use 8 cores as our example.

Here is the full launch command:

# Run this command from your '~/llamafile-power-setup' directory
# This launches the OpenAI-compatible server

./llamafile-0.8.6 \
    # --- Server Flags ---
    --server \
    --port 8080 \
    --host 0.0.0.0 \
    
    # --- Model Flags ---
    -m ./models/gemma-3-27b-it-q8_0.gguf \
    -c 131072 \
    
    # --- Performance Flags (CPU/RAM/GPU) ---
    # --mlock: Force model into RAM. CRITICAL for performance.
    --mlock \
    
    # -t: Thread count. Set to your PHYSICAL core count.
    -t 8 \
    
    # -ngl 99: Offload 99 layers to the GPU.
    # If you have a GPU, this is essential.
    # If you are CPU-only, set this to 0 (or just omit it).
    --n-gpu-layers 99

Your terminal will now show server logs. You have a high-performance, OpenAI-compatible API running at http://127.0.0.1:8080.

### Phase 3: Customize for Code (Apply LoRA)

This is what makes the llamafile server so powerful. You don't need to build a new model. You can "hot-swap" a LoRA by just adding a flag at launch.

Let's assume you've downloaded a rust-code-lora.gguf into your ./models folder.

You would simply add one flag to the launch command:

./llamafile-0.8.6 \
    --server \
    --port 8080 \
    -m ./models/gemma-3-27b-it-q8_0.gguf \
    -c 131072 \
    --mlock \
    -t 8 \
    --n-gpu-layers 99 \
    
    # --- The LoRA Customization ---
    --lora ./models/rust-code-lora.gguf

Now, the server running at http://127.0.0.1:8080 is serving your gemma-q8 model already specialized for Rust. You can have multiple scripts to launch different "specialist" servers.

### Phase 4: Integrate with Continue.dev for the Best Code Experience

This is the final step. We will point Continue.dev at our new, high-performance llamafile server.

Continue.dev can connect to any OpenAI-compatible API.

  1. Open VS Code and go to your ~/.continue/config.yaml file.

  2. Paste in this configuration:

    models:
      - name: "llamafile-coder"
        title: "Gemma 3 27B Coder (llamafile)"
        
        # We use the "openai" provider, as llamafile emulates it
        provider: "openai"
        
        # This is just a label for the API, it can be anything
        model: "gemma-3-27b-it-q8_0" 
        
        # --- The Integration ---
        # Point to your local llamafile server
        apiBase: "http://127.0.0.1:8080/v1" 
        
        # The API key can be any non-empty string
        apiKey: "llamafile" 
    
      - name: "Local Embedder (SOTA)"
        title: "MixedBread Embed Large"
        provider: ollama
        model: "mxbai-embed-large-v1:latest" 
        apiBase: "http://localhost:11434"
        embed: true
    
    # --- Set your models as default ---
    modelRoles:
      chat: "llamafile-coder"
      edit: "llamafile-coder"
      
    embeddingsProvider: "Local Embedder (SOTA)"
    
    # ... (rest of your contextProviders config) ...

    (Note: This setup still uses your Ollama server for embeddings, as it's the simplest way to manage the mxbai-embed-large model. Your llamafile server will handle all the generation.)

  3. Reload VS Code.

You are now 100% operational. When you type @codebase in Continue.dev:

  1. Continue.dev uses your Ollama mxbai-embed-large model to index your code.
  2. It finds the relevant code snippets.
  3. It sends the prompt and code context to your llamafile server at http://127.0.0.1:8080.
  4. Your llamafile server, running with locked RAM and full GPU/CPU acceleration, generates the code response using the high-fidelity gemma-q8 model (with the Rust LoRA, if you added it).

You have successfully combined the raw power of a natively-run llama.cpp engine with the simplicity of a llamafile server and the deep integration of Continue.dev.